光学相干断层扫描(OCT)是一种非侵入性技术,可在微米分辨率中捕获视网膜的横截面区域。它已被广泛用作辅助成像参考,以检测与眼睛有关的病理学并预测疾病特征的纵向进展。视网膜层分割是至关重要的特征提取技术之一,其中视网膜层厚度的变化和由于液体的存在而引起的视网膜层变形高度相关,与多种流行性眼部疾病(如糖尿病性视网膜病)和年龄相关的黄斑疾病高度相关。变性(AMD)。但是,这些图像是从具有不同强度分布或换句话说的不同设备中获取的,属于不同的成像域。本文提出了一种分割引导的域适应方法,以将来自多个设备的图像调整为单个图像域,其中可用的最先进的预训练模型可用。它避免了即将推出的新数据集的手动标签的时间消耗以及现有网络的重新培训。网络的语义一致性和全球特征一致性将最大程度地减少许多研究人员报告的幻觉效果,这些效应对周期矛盾的生成对抗网络(Cyclegan)体系结构。
translated by 谷歌翻译
对少量语义分割(FSS)的研究引起了极大的关注,目的是在查询图像中仅给出目标类别的少数注释的支持图像。这项具有挑战性的任务的关键是通过利用查询和支持图像之间的细粒度相关性来充分利用支持图像中的信息。但是,大多数现有方法要么将支持信息压缩为几个班级原型,要么在像素级别上使用的部分支持信息(例如,唯一的前景),从而导致不可忽略的信息损失。在本文中,我们提出了密集的像素,互源和支持的注意力加权面膜聚合(DCAMA),其中前景和背景支持信息都是通过配对查询和支持特征之间的多级像素的相关性通过多级像素的相关性充分利用的。 DCAMA在变压器体系结构中以缩放点产生的关注实现,将每个查询像素视为令牌,计算其与所有支持像素的相似之处,并预测其分割标签是所有支持像素标签的添加剂聚集 - 相似之处。基于DCAMA的唯一公式,我们进一步提出了对N-shot分割的有效有效的一通推断,其中所有支持图像的像素立即为掩模聚集收集。实验表明,我们的DCAMA在Pascal-5i,Coco-20i和FSS-1000的标准FSS基准上显着提高了最先进的状态以前的最佳记录。烧烤研究还验证了设计dcama。
translated by 谷歌翻译
最近,已广泛研究了基于深度学习的方法,以进行可变形的图像注册任务。但是,大多数努力将复合图像表示形式直接映射到通过卷积神经网络的空间转换,而忽略了其捕获空间对应关系的有限能力。另一方面,变压器可以更好地表征与注意机制的空间关系,其远程依赖性可能对注册任务有害,在这种情况下,距离太大的体素不太可能是相应的对。在这项研究中,我们提出了一个新型的变形器模块,以及用于可变形图像配准任务的多尺度框架。变形器模块旨在通过将位移矢量预测作为几个碱基的加权总和来促进从图像表示到空间转换的映射。借助多尺度框架以粗略的方式预测位移字段,与传统和基于学习的方法相比,可以实现卓越的性能。进行了两个公共数据集的全面实验,以证明所提出的变形器模块以及多规模框架的有效性。
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译
With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.
translated by 谷歌翻译